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Abstract. We study the use of sampling for efficiently mining the top-_ftr 
frequent itemsets of cardinality at most w. To this purpose, we define an 
approximation to the top- A" frequent itemsets to be a family of itemsets 
which includes (resp., excludes) all very frequent (resp., very infrequent) 
itemsets, together with an estimate of these itemsets' frequencies with 
a bounded error. Our first result is an upper bound on the sample size 
which guarantees that the top- if frequent itemsets mined from a random 
sample of that size approximate the actual top-K frequent itemsets, with 
probability larger than a specified value. We show that the upper bound 
is asymptotically tight when w is constant. 

Our main algorithmic contribution is a progressive sampling approach, 
combined with suitable stopping conditions, which on appropriate inputs 
is able to extract approximate top-K frequent itemsets from samples 
whose sizes are smaller than the general upper bound. In order to test 
the stopping conditions, this approach maintains the frequency of all 
itemsets encountered, which is practical only for small w. However, we 
show how this problem can be mitigated by using a variation of Bloom 
filters. 

A number of experiments conducted on both synthetic and real bench- 
mark datasets show that using samples substantially smaller than the 
original dataset (i.e., of size defined by the upper bound or reached 
through the progressive sampling approach) enable to approximate the 
actual top-K frequent itemsets with accuracy much higher than what 
analytically proved. 

1 Introduction 

For a dataset 2? of transactions over an alphabet of items I, the top-K frequent 
itemsets is the family of subsets of items, dubbed itemsets, which occur in D with 
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frequency greater than or equal to the frequency of the K-th most frequent one. 
The extraction of the top-K frequent itemsets is a fundamental primitive which 
arises directly in applications from many different domains (e.g., data mining, 
networking, and databases), and it is also regarded as a convenient alternative 
to the classical data mining problem of computing all itemsets with frequency 
above a fixed threshold, since the parameter K allows, in practice, for a better 
control on the output size. 

For large datascts the mining of top- A' frequent itemsets becomes computa- 
tionally challenging. In particular, exact algorithms must scan the entire dataset 
more than once and if the dataset does not fit completely in main memory, as 
is typically the case, disk accesses may slow down the computation to a point 
where it becomes impractical. In this paper we consider the use of sampling for 
efficiently mining a suitable approximation to the top-K frequent itemsets from 
large datasets, and investigate the trade-offs between the sample size and the 
accuracy of the approximation. 

1.1 Previous work 

A few attempts have been made to devise efficient exact algorithms for min- 
ing top-K frequent itemsets from static datasets, restricting the mining task to 
closed itemsets (i.e., itemsets that have no superset with the same frequency) 
[14, 11]. However, these algorithms exhibit limited scalability and are unable to 
efficiently handle extremely large inputs. 

A number of works in the literature explored the use of sampling in the 
context of the classical frequent itemsets mining problem where all itemsets with 
frequency above a fixed threshold are required [16, 6, 12, 5, 10, 2]. The idea is to 
extract frequent itemsets from a sample of transactions randomly drawn from the 
input dataset, where the size of the sample is chosen, possibly much smaller than 
the original dataset size, so that the frequent itemsets with respect to the sample 
represent a good approximation to the actual frequent itemsets. Upper bounds 
on the sample size which guarantee to estimate the frequency of an itemset 
with a given accuracy and a given confidence are presented in [16,6]. However, 
these estimates are derived for the individual itemsets identified by the algorithm 
and do not bound the probability of not discovering other itemsets of equal or 
larger frequencies. Ensuring with high probability that all itemsets sought by the 
mining task be discovered is the main challenge in applying sampling to mining. 

In [12] the author proposes an algorithm that, by mining a random sample 
of the dataset, builds a candidate set of frequent itemsets which contains all the 
frequent itemsets with a probability that depends on the sample size. However, 
the sample does not guarantee that all itemsets in the candidate set are frequent. 
Nevertheless, the set of candidates allows the algorithm to efficiently identify the 
set of frequent itemsets with at most two passes on the entire dataset. 

The use of progressive sampling for mining frequent itemsets has been studied 
in [5, 10]. Progressive sampling entails analyzing increasingly larger samples until 
the observed improvement of a certain measure of the accuracy of the sample 
with respect to the mining task falls below a specified threshold. An empirical 



two-phase sampling method was devised in [2] which first uses a large sample to 
estimate the item frequencies accurately, and then uses this information to build, 
progressively, a suitable sample for mining the frequent itemsets. We remark 
that the aforementioned works do not assess analytically the relation between 
the sample size and the accuracy of the approximation to the frequent itemsets 
gathered from the sample. 

The extraction of top-i^ frequent itemsets through sampling is a more chal- 
lenging task, since the corresponding minimum frequency threshold is not known 
in advance. In a recent work [13] sampling is used for extracting, for a fixed 
parameter < £ < 1, an approximation of the top-K frequent items from a 
sequence of items, which contains no item whose actual frequency is less than 
/k — e, where /k is the actual frequency of the K-th most frequent item. How- 
ever, to achieve this result the sample size is defined as a function of Jk, which 
is unknown. The authors propose an empirical sequential method to estimate 
the right sample size. Moreover, the results cannot be directly extended to the 
mining of top-K frequent item(set)s from datasets of transactions. 

A number of recent papers considered the problem of finding (top-K) frequent 
itcms/itemscts from data streams [4. 7, 1, 8, 15. 3]. In the streaming context, the 
entire dataset is scanned (at least) once and sampling is employed to maintain 
summary data structures, with small memory footprints, which provide approxi- 
mate solutions to the mining task. The accuracy of the approximation is related 
to the size of the data structures rather than the actual number of elements 
sampled from the input dataset. In this respect, the objective of these works is 
slightly different from the one pursued in this paper. 

1.2 Our contribution 

We present novel results on the effectiveness of sampling for mining top-K fre- 
quent itemsets from a transactional dataset. In analogy to [13], we define an 
£-approximation to the top-K frequent itemsets to be a family of itemsets, to- 
gether with their estimated frequencies, which excludes all itemsets whose actual 
frequency is less than fx — e and includes all itemsets whose actual frequency 
is at least /k + s, where Jk is the actual frequency of the K-th most frequent 
itcmsct. (Note that the latter requirement was not imposed in [13[.) Moreover, 
the estimated frequency of each itemset in the e-approximation must be at most 
an additive factor s away from the actual one. We also impose a bound w on 
the size of the frequent itemsets to be discovered 

We prove an upper bound t on the sample size which guarantees, with prob- 
ability at least 1—S, that an s-approximation to the top-K frequent itemsets of 
size up to w is discovered. The bound is a function of the parameters K, e, S, 
and of the total number of itemsets of size at most w, but it does not depend 
on the number of transactions in the dataset or on Jk- We also show that for 
w G 0(1) the upper bound t is tight, within constant factors, by arguing the 
existence of a dataset for which a random sample of size o{t) would not provide 
the required £-approximation with sufficiently high probability. 



While the above bounds are tight for worst-case datasets, an adaptive ap- 
proach may be able to compute an e-approximation to top-i^ frequent itemsets 
with smaller samples in some cases. To this purpose, we devise a progressive 
sampling approach which extracts the top-K frequent itemsets from increasingly 
larger saiiiplcs until suitable stopping conditions are met or the upper bound t 
is hit. A straightforward implementation of our approach requires that in order 
to test the stopping conditions, the frequencies of all itemsets encountered be 
maintained, which is practical only if the upper bound w on the itemset size is 
small. However, we show how this problem can be mitigated by using count-min 
filters, which are a variation of Bloom filters. 

A number of experiments conducted on both synthetic and real benchmark 
datasets show that the family of top-i^ frequent itemsets mined either from 
a random sample of t transactions or from the smaller samples built with the 
progressive sampling approach approximates the actual top-K frequent itemsets 
with an accuracy and confidence much higher than what analytically proved. 
This provides further evidence of the effectiveness of the sampling approach 
for the mining task midcr consideration. Moreover, we show analytically on a 
specific artificial dataset and experimentally on a benchmark, that the stopping 
conditions employed by the progressive sampling approach are able to stop the 
sampling schedule before the sample size t is reached, while maintaining the 
same accuracy and confidence bounds on the output. 

The rest of the paper is organized as follows. Section 2 formally defines the 
notion of e-approximation to the top-K frequent itemsets. It also introduces 
some notation and a technical fact which will be used in the analysis. In Section 3 
we present the bound t on the sample size sufficient for performing the mining 
task with the specified confidence and show that it is tight when itemsets of 
size at most w = 0{1) are sought. Section 4 presents the progressive sampling 
algorithm. A more efficient implementation of the algorithm through count-min 
filters is outlined in Section 5. Section 6 reports the results of the evaluation of 
our approach. Section 7 closes the paper with some final remarks. 

2 Preliminaries 

Consider a dataset V of transactions, where each transaction r is a subset of a 
universe I of n items. Let |t| denote the number of items in transaction r. Given 
an itemset x G 2-^ , wc use fv{x) (resp., crx>(a;)) to denote its frequency (resp., 
support) in V, namely, the fraction (resp., number) of transactions containing 
X. We consider only itemsets of size at most w, and denote with U (I, w) the 
complete family of these itemsets. We define m = \U{X,'w)\ = J27=i (")• 

For convenience, we assume a fixed canonical ordering of the itemsets in 
U(I, w) by decreasing frequency in V, with tics broken arbitrarily, and label the 
itemsets xi, 2:2, • • • , Xm according to this ordering. For a given K, with 1 < i^T < 
m, we denote = fvixx)^ and define the set of top-K frequent itemsets of 
size at most w (with their respective frequencies) as 

TOVK{V,I,K,w) = [{x,h{x)) : x e U{I,w) , Mx) > /^^^j . 



In this work we aim at efficiently mining the following approximation to the 
set TOPK{V,I, K,w). 



Definition 1. Let e e (0,1) he a real-valued parameter. An £-approximation 

to TOPK(I?,T, if, w;) is a set W of K or more ordered pairs {x,f) such that 
X G U{I,w), f e [0. 1], and for which the following properties hold: 



PI 
P2 
P3 



for each {x, /) G W, fv{x) > f^^ — e; 
for each (x, /) ^ W, fv{x) < f^""^ + e; 
for each {x, f) GW,\f- f-nix)] < s. 



Given a sample iS, we will use fs{x) to denote the frequency of an itemset x 
in S, and as{x) to denote the support of x in S. The following fact will be used 
later in the analysis. Due to space constrains, its proof is omitted. 

Fact 1 Consider a sample S of t > transactions drawn at random with re- 
placement from v. Let x G 2^ be an arbitrary itemset. For fixed s G (0, 1) and 
for any itemset y G 2-^ such that fv{x) > fv{y) + s we have: 

Pr(/5(y) > fs{x)) < e-4* and (1) 

Pr(|/5(x)-/p(x)| >£)<2e-4* . (2) 



3 Upper bound on the sample size 

Consider a sample 5 of i transactions drawn at random with replacement from 
V. Theorem 1 below shows that if t is large enough then the set of top-K 
frequent itemsets from S, with their respective frequencies in the sample, yields 
an e-approximation to TOPKCD, I, K,w), with a certain probability. 

Theorem 1. For fixed e, 5 G (0,1), let S he a sample of 

_ 2 2m + K{m- K) 

transactions drawn at random with replacement from T>. Then, W = 
TOPK{S, X,K,w) is an s- approximation to TOPKCD,!, K,w) with probability 
at least 1 — 5. 

Proof. Consider the m itemsets of U{I,w), namely xi,X2, ■ • ■ ,Xm,, indexed ac- 
cording to the canonical ordering introduced before, and define the follow- 
ing property Q: "for every i,j, with 1 < i < K < j < m, such that 
fv{xi) > fv{xj) + e we have fs{xi) > fsixj)". We first show that if prop- 
erty Q holds then properties PI and P2 of Definition 1 hold for W. Assume 
that Q holds. Then, the frequency in S of any .x^, with 1 < z < if, is larger 
than the frequency in S of any itemset Xj with fT>{xj) < f^^ — e, and this 
implies Property PI. As for P2, consider a pair {x,fs{x)) ^ W and suppose. 



by contradiction, that /©(.x) > f-p + e, that is x = Xi, for some index i with 
1 < i < K. U this is the case, since W contains at least K pairs, it must contain 
a pair {xj, fs{xj)) with j > K and fs{xj) > fs{xi)- This is impossible because 
Q holds and fv{xi) > fv{xj) + e. 

We complete the proof by showing that if the sample size t is chosen as 
stated, then, with probability at least 1 — (5, both Q and P3 (from Definition 1) 
hold. Consider a pair of itemsets {xi,Xj) with l<i<K<j<m such that 

f-oixi) > fv{xj) + e. By Fact 1 we have Vv{fs{xj) > fs{xi)) < e~^*. Also, 
from Relation (2) we have that for any itemset Xi, with 1 < i < m, Pr{\fs{xi) — 

fvixi)\ > e) < 2e^~*' . Note that K{m — K) pairs are involved in Q and at most 
m itemsets are involved in property P3. Therefore, by applying an union bound, 
the probability that Q or P3 do not hold is at most {2m + K {m — K))e < S, 
and the theorem follows. □ 

Note that t is independent of the number of transactions in T> and of the fre- 
quencies of the itemsets in V. 

We now show that for K,w G O (1) and constant e, if we fix the confidence 
parameter 5 suitably small, yet constant, then for a random sample S of size 
t € o(lnm) = o((l/£2)ln(TO/5)) the probability that TOPK{S, I, K,w) is an e- 
approximation to TOPK('D, X, K, w) can be made smaller than 1 — 5 by choosing 
the number of items and the number of transactions in D large enough. Consider 
a universe I{£) of K + £ items, namely = {yi, . . . , U {xi, . . .,xe}, where 
K is fixed and £ is a parameter. 

Theorem 2. Let K,w € O (1) and consider an arbitrary constant e G (0, 1/4). 
Fix a constant S < 1 — (1/2)^. Let t{x) be any integer-valued function such 
that t{x) £ o(lnx). Then, for large enough £, there exists a dataset D* over 
I{£) such that for a random sample S of t{£) transactions, the probability that 
TOPK(<S, I, K, w) is an e- approximation to TOPK(X>, I, K, w) is less than 1 — 6. 

Proof. Fix pk G (0, 1) such that pk > 2e and pk — Pk > ^) '^'^^ l^t pe, = 
Pk — 2£. Consider a random dataset V oi N transactions over T{£), where in 
any transaction each item y^ (resp., Xj) is included with probability px (resp., 
P() independently of all other items and all other transactions. Hence, pk is the 
expected frequency of each yi, and pi the expected frequency of each Xj . Let p be 
the minimum frequency in V of the items yi, . . . , yx- It can be easily shown that, 
by the choice of pk, if N is large enough the event F = "TOPK(!D,I, iiT, w) = 
{j/i, . . . ,yK} o,n,d all other itemsets ha,ve frequency smaller than p — e" holds 
with probability 1 — o(l). Thus the only itemsets that can be reported in an 
e-approximation are the items {yi, . . . For a sample <S, define the events 

El = at least one of xi, . . . ,Xf appears in S with frequency > Pk \ and E2 = 
^^at least one ofyi,..., yx appears in S with frequency < px"- When F, Ei and 
E2 occur, TOPK(<S,X, K, w) does not satisfy property PI. 

In what follows we prove that if 5 is a random sample of t{£) = o(ln£) = 
o(lnrn) transactions from a dataset V built as above, then Pr(F n £^1 fl E2) > 
1 - 0(1) - (1/2)^. This implies that Pr(£;i n £;2|i^) > 1 - o(l) - (1/2)^, and 



thus there must exist a dataset D* for which the event F holds and such that 
if iS is a random sample of t{£) transactions then Pr(TOPK(iS,I, K, w) is an e- 
approximation to TOPK{D* ,X, K,w)) < (l/2)^ + o(l). Hence, this probability 
can be made smaller than 1 — ^ by choosing N and £ sufRcintly large. 

Since wc already argued that for A'' large enough event F occurs with proba- 
bility at least l-o(l), it is sufficient to prove that Pr(£;in£;2) > l-o(l)-(l/2)^. 

We first show that if S has size t(£) then Pt{Ei) = 1 - o(l). Let = 1 if 

e 

fs{xi) > Pk and Xi = otherwise. Then X = ''^^Xj is a random variable 

i=l 

counting how many of the items Xi , . . . , a;^ appear in S with frequency at least 
Pk. Thus Pt{Ei) = Pt{X > 1). We have^ 

Pr(X, = !)>(/ -f^)*"*"" ^ (—y^ {l-PeY-'^^ 

VPkJ \Pk J 

Let d > 1 be a constant such that 1/d = va\n{p(,/pKA — Pe}- Then 
{piIpkT" (1 - P^)*"*^^ > We then have Pr(X = 0) < (l - < 

^-t/d If i g o(lni') = o(lnm), we have Pr(X = 0) = o(l), that proves 
Pr(£i) = Pr(X >!) = !- Pr(X = 0) = 1 - o(l). 

Now we turn our attention to the K items yi,. ■ ■ ,yK- We have 



K 

Pv{E2) = 1 - n P^fsiVi) >Pk)>1- 




where the last inequality follows from the fact that pk is the expected frequency 
of yi. Thus, Pr(£;i n £^2) > 1 - o(l) - (1/2)^ and the theorem follows. □ 



4 Algorithm for approximating the top-X frequent 
itemsets 

We now describe an efficient algorithm which discovers an ^-approximation to 
TOPK(I?,I, if, by mining progressively larger samples of the dataset V un- 
til the sample size established in Theorem 1 is reached, or a certain stopping 

condition is met. When the algorithm stops, it returns, as output, the set 
TOPK(iS*,I, -ftT, w) , where S* is the last processed sample. For j > 0, define 




Let also jmax > be the smallest index such that 

> min{|2?|,(2/e2)ln((2TO-HK(m-X))/(5)} . 

The algorithm performs a sequence of phases. Specifically, in Phase j, for 
j > and j < jmax, the algorithm processes a random sample of tj transactions. 

^ For notational convenience, we replace t{£) with t in the formulas. 



A different sample schedule can be used, provided that the size of the sample 
processed at Phase j is at least tj, and the results we present would still hold. 
In practice it may be more efficient to use a geometrical progression of sample 
sizes, starting at defined as above. 

In Phase jmax, if ijmax — \^\ the algorithm processes V to extract 
TOPK{I), X, K,w), otherwise it considers a random sample of tj^^^ transac- 
tions. The algorithm stops when j = jmax, or j < jmax and a suitable stopping 
condition (specified below) holds. 

Consider Phase j and let S be the random sample of size tj processed in the 
phase. Define = K {§y . For i > 0, define also Sj{i) = [{2aj)'^'+^^^ /2}, and 

= J2l=o Sji^)- For notational convenience, we assume Sj{—1) = and use 
h{j) as the largest index such that Sj{h{j) — 1) + 1 < m. Consider an ordering 
of the itemsets of U{T,w) by decreasing frequency w.r.t. <S, and let /^^'' denote 
the frequency in S of the £-th itemset of U {I, w) in this ordering. The stopping 
condition for Phase j is 

^(K) _ ^(5,(^-l) + l) ^ ^ ^ < ^ ^ ^^.^ 

A pseudocode for the algorithm is given in Algorithm 1, where the function 
StoppingConditionlsSatisfied checks whether the above stopping condition 
holds. The efficient implementation of this function is discussed in Section 5. 



Algorithm 1: Ps(nuloco(k> of the algorithui 



input : datasct V, integers K, w, reals e, S : < e, S < 1. 
output: A collection of ordered pairs (x, /) which is an £-approximation to 
TOPK(X>,X,/s:,w) with probability at least 1-6. 

1 m <r- \U{I, w)\; bound ^ min | ^ In | 

2 j <- 1; jmax <- argmin : (in ^ + «) > bound} 

3 while j < jmax do 

4 t,^^(lnS|£+j) 

5 <S random sample of size tj from I> 

6 if StoppingConditionlsSatisfied then return TOPK(iS,X, /sT, w) 

j ^ j + 1 

7 end 

8 if bound < |D| then S <— a, random sample of size bound from D 

9 else 5 <- O 

10 return T0PK(5,2:, if, w) 



We now show that, with probability at least 1 — 6, the set returned by the 
above algorithm is an e-approximation to TOPK('D,I, K, w). For Phase j of the 
algorithm we define Bj{i), with < i < h{j), as the set of Sj{i) itemsets of 
U{I,w) whose rank in the canonical ordering (w.r.t. the original dataset V) is 
in the interval [Sj{i — 1) + 1, Sj{i)]. 



Lemma 1. The following property holds with probability at least 1 — S: for every 
Phase j of the algorithm, for every <i< h{j), and for every itemsetx G Bj{i): 

\fs{x) - fv{x)\ < + 

where S is the sample processed in Phase j. 

Proof. Let us focus on an arbitrary Phase j. From Fact 1, Relation (2), we have 
that for any x G Bj {i) 



Pr (|/5(x)-/i,(a;)|>(i + l)|) <2e- 



Hence the probabihty that there exists an itemset x (belonging to any Bj{i)) 
for which the stated bound does not hold is upper bounded by: 

hU) h(j) 2 hU) . . X (z+l)2 



i=0 



2i+i 



The lemma follows by applying the union bound over all phases (i.e., j = 
0,1,...). □ 

The following theorem establishes the desired probabilistic guarantee on the 
correctness of the algorithm. 

Theorem 3. The algorithm returns an s- approximation to TOPK(X>,I, if, 
with probability at least 1 — 6. 

Proof. We consider two cases, depending on when the algorithm stops. If the al- 
gorithm stops at Phase j = jmax, then the output is correct since it coincides with 
the set TOPK(r>,I, K, w), if tj^^^ > \V\, or, otherwise, it is an e-approximation 
to TOPK{'D, T, K,w) with probability at least 1 — 5 as shown by Theorem 1. 
Suppose instead that the algorithm stops at an earlier phase j < jmax because 
the stopping condition for Phase j is met, and let S denote the sample used in 
this phase. By Lemma 1, for every < i < h{j), and for every itemset x G Bj{i), 
we have \fs{x) - fv{x)\ < {i + Let W = TOPK{S^I,K,w) be the set 
returned by the algorithm. We now prove that W satisfies properties PI, P2, 
and P3 of Definition 1. 

We first show that for each {x,fs{x)) G W, we have that x G Bj{Q). By 
contradiction, assume that for some {x,fs{x)) G W, x G Bj{i), for some i > 0. 
Hence, /p(a;) < and 



ff ^ < /5(.x) < fv{x) + {^ + if- < /(f^(-i)+i) + (z + if- 



Observe that all itemsets whose rank in the canonical ordering (w.r.t. V) is 
not larger than Sj{i — 1) + 1 belong to sets Bj{i) with ^ < i. By Lemma 1, for 
each such itemset z, we have that 

fs{z) > fv{z) -{i + if- > /(f^(-^)+^) -{i + if- . 



Hence, since there are Sj{i — 1) + 1 of these itemsets, it follows that 

(^-H^)>/(f^('-)+^)-(. + l)| . (5) 

By combining Equations 4 and 5 we obtain that fg^^ — fg^'^^~^^~^^^ < (z+l)£, 
which contradicts the stopping condition (3). Thus, all itcniscts occurring in W 
belong to -Bj(O). This fact, together with the inequality stated in Lemma 1 for 
the itemsets of Bj{0), establishes Property P3. 

Now, if we consider any of the first K itemsets of U{I,w) in the canonical 
ordering, say xg, for some 1 < i < K , which belongs to Bj{0) by construction, we 

have that fsixe) > fvixe) - | > - §• Hence, /^^^ > - |. Therefore, 
for each (a;, fs{x)) gW we have 

/p(:^:)>/5(x)-|>/f)-|>/f)-e, 

which establishes Property PI. Finally, in order to establish Property P2, we 
observe that W must contain a pair {z, fs{z)) such that fv{z) < ■ As argued 
before, z e -Bj(O), hence 

/f^</5(z)</2.(z) + |</f^ + | . (6) 

Consider an itcmsct y € U{X,w) such that {y,fs{y)) ^ W . If y € Bj{i) with 
i > then by definition of Bj{i) its real frequency is at most hence it 

cannot be greater than or equal to + e. If instead y € Bj{0) we have 

My) < fs{y) + 1 < /f ^ + 1 < /^""^ + e, 

where the last inequality follows from Equation 6.) Thus, Property P2 is estab- 
lished. □ 



5 Efficient implementation with count-min filter 

A straightforward implementation of function StoppingConditionlsSatisf ied 
presented in Section 4 requires m = \U{I,w)\ counters to store the observed 
frequencies of all itemsets in order to evalute the stopping condition (3). We now 
describe an efficient implementation which uses count-min filters, a variation of 
Bloom filters, to save space. For a dataset with 0(1) transaction size the use of 
count-min filters reduces the space requirements from m to O(logm) counters. 

A count-min filter B consists of c counters, and uses ks hash functions. 
The counters are split into ks disjoint groups of size ^ (we assume that fcs 
divides c evenly). The ks hash functions map itemsets into counters, so each 
hash function Hi,l < i < ks is a map from the set U{I,w) to the integers 
in the range [{i — l)c/kB,ic/kB — 1]. A more detailed description of count-min 
filters and their properties can be found in [9, Section 13.4]. Given a sample 



S, we use a count-min filter B to approximatethe frequencies of itemsets in S. 
Initially, all counters are set to 0, then, for each transaction t E S and each 
itemset x £ U{I, w) appearing in r, we increment by one the Ub counters Hi{x) 
associated with x. 

We now introduce some definitions and some results on count-min filters 
which we will use later in the analysis. The count-min support of an itemset x 
is the value of the minimum of the ks counters associated with x in B, and is 
denoted with asix). The count-min frequency of x is fsix) = ^^p- (In the 
notation for count-min support and count-min frequency wc omit any reference 
to S because the set of transactions on which the count-min filter is built will 
be clear from the contest.) We denote the sum of the number of itemsets from 



U (X, w) in the transactions of S as Cs = ^ ^ 

res i=Q 

The following theorem (proof omitted due to space constraints) shows that 
we can obtain a good approximation of the frequencies of the itemsets using a 
count-min filter. 



Theorem 4. Given 5b > 0, £b > 0, and a sample S, let ec = 
5c = Sb/tu. If B is a count-min filter of parameters Ub = 
then Pr(3a;|/B(a;) > fs{x)+eB) < Sb- 



Cs 

and c = 



While the count-min filter is useful in reducing the space required to ap- 
proximate the frequencies of itemsets in the sample, it is not trivial to check 
the stopping condition without explicitly querying the filter for the count-min 
frequency of every itemset in U {I, w) (not only of those that appear in the sam- 
ple), an operation that can be computationally too expensive. Our algorithm 
will make use of an approximation of the distribution of the frequencies in the 
sample of the itemsets, built using only the min-count frequencies that appear 
in the min-count filter, without generating all the itemsets. The algorithm uses 
the same parameters 2?, K, e, and 6 as the algorithm of Section 4. 

Let 61,62 > such that {1 — 6i){l — 62) = 1 — 6. We define tj similarly to 
Section 4, using 61 instead 0(6. Let j'max, o'k, Sj{i), Sj{i), and h{j) be defined as 
in the algorithm of Section 4. The algorithm performs a sequence of phases and 
stops when j = jmax, or j < jmax and a suitable stopping condition (Equation (7) 
specified below) holds. 

Let S be the sample analyzed by the algorithm at phase j. At each phase 
the algorithm will use a count-min filter B with parameters c, fcs tuned so that 
Pv{3x\ fB{x) > fs{x) + £b) < 62 (see Theorem 4). Note that es is not defined 
by the user. First, the algorithm obtains TOPK{S,T, K,w). Then, we scan the 
sample and populate the min-count filter B as described before. With a second 
scan of the sample, the algorithm computes an approximation / to the distri- 
bution of frequencies of itemsets in the samples, so that for all j, if /'^■'^ is the 
frequency of the j-th most frequent itemset using this approximation, > /J^^ 
holds. For each itemset x S U{X, w) appearing in <S, let be the counter of B 



with minimum value among those associated to x, and let, for each counter I of 
B, Ti = {x E U(I, w) : Cx = The approximation / is computed as follows: 
for each counter £ oi B, another counter is created, with initial value zero. 
The algorithm scans <S and, for each transaction r in <S and each itemset x of 
length up to w appearing in r, the algorithm increases Sc^ by one. Once the scan 
is terminated, the value of S£ will be ve = ^ erg (a;). Since we built B so that, 

with probability at least 1 — ^2, for each x Gig we have as{x) > <tb{x) — eB\S\ 



then, if Theorem 4 holds, the value rg = 



^ 

o-b(x)-£b|5| 



is an upper bound to \Ie \ . 



For each counter i of B let x be an itemset in li, then wc define fe = ^-^p- To 
obtain an ordering for the approximate frequencies of all itemsets in U{I, w), we 
need to sort only the p frequencies of the p counters in B, since in our approxi- 
mation there will be re itemsets with frequency fe. Let / be the frequency at 
the i-th position in this order, for 1 < i < m. By definition of /'•*■' and le, we 
have that /^'^ > The stopping condition for phase j is then 

_ j(5,(i-i)+i) > (• ^ for 1 < i < h{j) . (7) 

Note that the choice of es influences the stopping condition, since the accu- 
racy of / depends on sb- When the algorithm stops, it returns the set of top- if 
frequent itemsets and their respective frequencies with respect to the last pro- 
cessed sample. The following theorem (proof omitted due to space constraints) 
easily follows from the considerations above. 

Theorem 5. The output of the min-count filter based algorithm is an e- 
approximation to TOPK(D,X, ii", w) with probability at least 1 — 6. 

6 Evaluation 

In this section, we provide evidence of the effectiveness of our results. Specifically, 
in Subsection 6.1 we evaluate experimentally the quality of the approximation 
of the top-K frequent itemsets obtained by mining small samples with the sizes 
derived in the previous sections. In Subsection 6.2 we provide both analytical 
and experimental evidence that the stopping condition used by the algorithm 
presented in Section 4 is effective for certain datasets. 

6.1 Evaluation of the quality of the output 

We first conduct an analysis of the "quality" of the output set obtained either 

by mining a sample of a datasct with a size set by the bound presented in 
Theorem 1, or by running the algorithm of Section 4. We used the real datasets 
kosarak and webdocs, and the artificial dataset TIOI4DIOOK from the FIMI 
repository * whose main characteristics arc synthetizcd in Figure 1. 

Each of these datasets has a different distribution of the frequencies of the 
items. We used several values for K (1; 2; 5; 10; 100; 1000) and the values 1, 2, 



^ littp://f imi . cs .helsinki .f i/data/. 



Dataset 


#Items 


Avg. Trans. Length 


# Transactions 


TIOI4DIOOK 


1,000 


10.1 


100,000 


kosarak 


41,270 


8.1 


990,002 


webdocs 


5,267,656 


177 


1,692,082 



Fig. 1. Datasets characteristics 



and 3 for w. In all of oiir experiments, 5 was fixed to 0.1 and s to 0.02. For 

each combination (dataset, K, w), we mined T0PK(5,I, k, w) from 100 random 
samples S of size derived from Theorem 1, and applied 100 times the algorithm 
of Section 4. (We did not run the algorithm for webdocs with w = 2,3 due to the 
inefficiency of the current implementation of the algorithm which does not use 
the count-min filter.) In all cases we considered, the size suggested by theoretical 
bound was considerably smaller than the size of the dataset. In particular, for 
kosarak the bound suggested a sample size approximately 20% of the size of the 
dataset, while for webdocs it was between 5% and 10%, and for TIOI4DIOOK 
around 40%, because this last dataset has a smaller number of transactions. 

In all of the runs, for any tested combination of parameters, the output 
was an e-approximation to T0PK(2?, I, if , w). This should be compared to the 
{1 — 5) = 0.9 probability of obtaining a e-approximation guaranteed by the 
thcorcthical results. Also, in all cases the output included at least 95% of the 
actual top-K frequent itemsets. (Note that the definition of e-approximation 
gives no guarantee on the fraction of actual top- if frequent itemsets returned.) 
In fact, we observed that the actual top-K frequent itemsets discovered from 
the sample were usually many more than those with actual frequency greater 
than f^^^ + e, which are guaranteed to be included in the output by definition 
of e-approximation. Most of the time (85% of the runs), the output contained 
exactly all of the actual top- if frequent itemsets. For w ~ 2,3 frequent itemsets 
of size greater than one were also correctly identified by mining the sample. In 
particular, for kosarak and w = 2, we always correctly identified all the frequent 
itemsets of size 2 at A; = 5 (3 such itemsets), fc = 10 (5 itemsets), and k = 100 
(65 itemsets) . For k = 1000 we always identify at least 750 such itemsets of size 
2, out of 765. For w = 3we were always able to identify all itemsets of length 2 (4 
itemsets) and length 3 (1 itemset) when k = 10. For TIOI4DIOOK, no frequent 
itemsets of size greater than 1 existed for the tested values of k. Finally, as far 
as the reported frequency is concerned, in all of the runs and for every itemset, 
the error between the reported frequency in the output and the real frequency of 
the itemset was much smaller than e, usually between e/10 and e/5. The higher 
accuracy of the observed output with respect to what promised by the theoretical 
analysis is explained by the fact that the latter relies on several approximations 
which weaken the bounds. 

6.2 Effectiveness of the stopping condition 

Below, we provide both analytical and experimental evidence that the stopping 
condition used by the algorithm presented in Section 4 is effective in the sense 



that, for certain datasets, the algorithm stops after mining a sample of size 
smaller than the upper bound of Theorem 1. 

Analytical evidence. Consider using the algorithm presented in Section 4 for 
mining of an ^-approximation to TOPK{'D, I, K,w) for a dataset V over a set 
I of n items with confidence at least 1 — 6. Recall that the algorithm probes 
increasing sample sizes tj, with j > 0, until the stopping condition is met or 
j — jmax! where the last sample size tj^^^ is the minimum between the dataset 
size and the upper bound given in Theorem 1. For convenience, fix i^T = 127=1 if) 
for some integer i > w, and choose the parameters n, e and w in such a way to 
guarantee that n > £, j^ax > 0, and px = {h{j) + l)e + {s/2) + {1/tj) < 1, for 
some j with < j < jmax- It can be easily shown that meaningful configurations 
of the parameters for which these conditions are satisfied exist (more details will 
be provided in the full paper). Fix one such value j. 

Let I = {xi,X2, ■ ■ ■ ,Xn} and define tq = {xi,X2, ■ ■ ■ ,xe}, and Tj = {xi}, for 
£ < i < n. Consider a dataset V consisting of A'' copies of tq and one copy of 
Ti = {xi} for each £ < i < n. Thus V contains a total oi N + n — £ transactions. 
We allow N to grow arbitrarily large and assume it is large enough to make 
N + n — £ > tj and to make the frequency of each of the K itemsets included in 
To greater than px- We have: 

Theorem 6. For a dataset V built as described above, the algorithm will stop 
at round j with probability at least 1 — S — o(l). 

Proof. Suppose we are at round j of the algorithm and that Lemma 1 holds, 
which happens with probability at least 1 — ^. Moreover, assume there is no 
itemset from the m — K not appearing in transaction to that has a frequency in 
the sample greater than j-. The probability of this second event is 1 — o(l) if A'' 
is large enough. 

Prom Lemma 1, we have that for any itemset x € Bj{i), < i < h{j): 

fs{x) > MX) -{i + if- > Mx) - {h{j) + if-. 

Then, the K itemsets in transaction to have frequency in the sample greater 
than {h{j) + l)f + | + j- > j- and are thus the top- if frequent itemsets in the 

sample. This means they belong to Bj{0), and then f^^^ > px — §• Hence we 

have f^^^ — f^^ > {h{j) + l)£,\/i,K < i < m, and the theorem follows. □ 

Experimental evidence. The ability of the stopping condition of halting the sam- 
pling schedule before the sample size established by Theorem 1 is reached was 
also observed when running the algorithm from Section 4 on kosarak with the 
same configurations of parameters K,w,e and 5 described in the previous sub- 
section. For this dataset, when w = 1 the algorithm always terminated when 
the sample size was equal to the theoretical bound, while for w = 2, 3 it some- 
times stopped earlier. Fig. 2(a) and Fig. 2(b) show a comparison between the 




(a) m = 2 




(b) w = 3 

Fig. 2. Experimental evaluation of the effectiveness of the stopping condition for 
kosarak. The dotted line represents the theoretical bound from Theorem 1, the solid 
line the size at which the algorithm stopped. 

theoretical bound and the sample size at which the algorithm terminated in 
our experiments because the stopping condition was satisfied, as function of K. 
We observe that the gap between the stopping sample size and the theoretical 
bound increases with w. which suggests that the stopping condition employed by 
our algorithm becomes more effective when the number of potential candidate 
itemsets (i.e., the size of the set U{I,w)) increases. 

7 Conclusions 

We studied the extraction of the top-K frequent itemsets of bounded size from 
random samples of a dataset of transactions. We defined a reasonable approxi- 
mation of the task and explored the tradeoff between the size of the sample and 
the accuracy of the approximation. In particular, we proved a bound on the sam- 
ple size sufficient to achieve a given accuracy of the approximation with a given 
confidence, and we showed that, under certain constraints on the parameters, 
the bound is tight within constant factors. To the best of our knowledge, this is 



the first tight relation between sample size and accuracy of the approximation 
for mining top-K frequent itemsets. We also proposed a progressive sampling 
algorithm that, in some cases, is able to ensure similar accuracy and confidence 
while mining smaller samples. For this algorithm, whose efficient implementation 
is challenging, we proposed an optimization based on count-min filters, a varia- 
tion of Bloom filters. The effectiveness of our results has been assessed on both 
artificial and real benchmark datasets. Future research could aim at obtaining 
tight upper and lower bounds on the sample size required to caisure given accu- 
racy and confidence in all cases, at characterizing the datasets and parameter 
configurations for which the progressive sampling algorithm becomes profitable, 
and at engineering an efficient implementation of this algorithm based on the 
count-min filter. 
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